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We study the problem of estimating the ridges of a density function. Ridge 
estimation is an extension of mode finding and is useful for understanding the 
structure of a density. It can also be used to find hidden structure in point cloud 
data. We show that, under mild regularity conditions, the ridges of the kernel 
density estimator consistently estimate the ridges of the true density. When 
the data are noisy measurements of a manifold, we show that the ridges are 
close and topologically similar to the hidden manifold. To find the estimated 
ridges in practice, we adapt the modified mean-shift algorithm proposed by 
Ozertem and Erdogmus (2011). Some numerical experiments verify that the 
algorithm is fast and accurate. 
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Notation 

distribution of X G 
density of P; x G M^^ 

gradient of p 

Hessian of p and its spectral decomposition 
eigenvectors of H 

projector onto the local normal space 
projected gradient 

d-dimensional ridge of p where < d < D 

integral curve (path to ridge) 

integral curve parameterized by arclength s 

derivative along the curve 7(s) 

the set of all symmetric D x D matrices 

Frechet derivative of L : 5 — > 5 

Gaussian density, mean covariance a^I 

density of X when there is a hidden manifold M 
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1. Introduction. In many problems, multivariate data have some in- 
trinsic low dimensional structure. Examples are clusters, manifolds, inter- 
secting manifolds et cetera. See Figures 1 and 2. These features may show 
up in the density as modes, ridges and hyper-ridges. In this paper we study 
a method for finding these features. We consider two cases: 

Case 1: (The Unstructured Case): We observe a sample Xi, . . . , X„ from 
a density p. The goal is to find tlie ridges (and hyper-ridges) of p. 

Case 2: (The Hidden Manifold Case): We observe a sample Xi, . . . ,X„ 
from a density p. There is an underlying hidden manifold M and there is a 
model relating the manifold to the ridges of the density. The goal is to estimate 
the ridges and relate the ridges to M. We show that the ridge i? is a surrogate 
for M, meaning that, under appropriate conditions, R is close to M and has a 
similar topology as M. The manifold can only be estimated at a logarithmic 
rate (Genovese et al., 2012b) but the ridge can be estimated at a polynomial 
rate. 

Our goal is to provide a theoretical framework for estimating ridges. In par- 
ticular we show that the ridges of a kernel density estimator consistently 
estimate the ridges of the density and we find the rate of convergence. This 
leaves open the question of how to locate the ridges of the density estimator. 
Fortunately, this latter problem has recently been solved by Ozertem and 
Erdogmus (2011) who derived a practical algorithm called the subspace con- 
strained mean shift (SCMS) algorithm for locating the ridges. Ozertem and 
Erdogmus (2011) derived their method assuming that the underlying dens- 
ity function is known. We, instead, assume the density is estimated from a 
finite sample. Our work provides a statistical justification for, and extends, 
their algorithm. 

Mode Finding. Because ridges are a generalization of modes, we briefly re- 
view mode finding (Klemela, 2009; Li et al., 2007; Diimbgen and Walther, 
2008). Let p be a density on M^. Suppose that p has k modes mi, . . . , m^. 

An integral curve, or path of steepest ascent, is a path vr : M — )■ MP such 
that 



Under weak conditions, the paths vr partition the space and are disjoint 
except at the modes (Irwin, 1980; Chacon, 2012). 




d 
dt 



Tr{t) = Vp{TT{t)). 




Figure 2. The Cosmic web. The matter in the universe forms a web of clusters, filaments 
and sheets. 
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Figure 3. The Mean Shift Algorithm. The data points move along trajectories during 
iterations until they reach the two modes marked by the two large asterisks. 

The mean shift algorithm (Fukunaga and Hostetler (1975); Comaniciu and 
Meer (2002)) is a method for finding the modes of a density by following the 
steepest ascent paths. The algorithm starts with a mesh of points and then 
moves the points along gradient ascent trajectories towards local maxima. 
We describe the method in Section 7. A simple example is shown in Figure 3. 
Mode finding is similar to level set estimation (Polonik, 1995; Cadre, 2006; 
Walther, 1997) but in general, the two problems are not the same. 

Overview of the Results. Here we give a non-technical description of the 
main results. A ridge of a density function p is a low dimensional set 
(such as a curve in two dimensions) where the density is sharply peaked in 
one direction and smooth in the perpendicular direction as in Figure 4. 

1. Stability. In Theorem 6 we give a stability theorem for ridges. We show 
that if two functions are sufficiently close together then their ridges are 
also close together. 

2. Estimation for Case 1. In Theorem 7 we show that there is an estimator 
R such that 
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where Haus is the Hausdorff distance, defined in equation (8). Further, 
we show that R is topologicahy similar to R. We also construct an 
estimator R^ for h > that satisfies 



where Rh is a smoothed version of R. 
3. Estimation for Case 2. Suppose the data are obtained by sampling 
points on a manifold and adding noise with small variance cr^. In The- 
orem 9 we show that the resulting density p has a ridge R^ such that 



and Ra is topologically similar to M. Hence when the noise a is small, 
the ridge is close to M. It then follows that 



Related Work. In the special case when the ridge is zero dimensional, ridge 
finding reduces to mode estimation and SCMS reduces to the mean shift 
clustering algorithm (Fukunaga and Hostetler, 1975; Li et al., 2007; Chacon, 
2012). 

If the hidden structure is a manifold, then the process of finding the struc- 
ture is known as manifold estimation or manifold learning. There is a large 
literature on manifold estimation and related techniques. Some useful refer- 
ences are Niyogi et al. (2006) Caillerie et al. (2011), Genovese et al. (2012c, b, 
2009, 2012a) and references therein. Previous work on ridge finding includes 
Cheng et al. (2004), Hall et al. (2001), Wegman and Luo (2002) and HaU 
et al. (1992). 

More generally, there is a vast literature on hunting for structure in point 
clouds and analyzing the shapes of densities. Without attempting to be ex- 
haustive, some representative work includes Davenport et al. (2010); Klemela 
(2009); Adams et al. (2011); Chazal et al. (2011); Bendich et al. (2012). 

Throughout, we use symbols like C, Co, Ci, c, cq, ci . . . to denote generic pos- 
itive constants whose value may be different in different expressions. 



(3) 




(4) 
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Figure 4. The ridge R is the lower dimensional set where there is high local density. 

2. Model. For the unstructured case, we assume that Xi,. . . ,X„ G 
is a random sample from the distribution P with density p. Our goal is to 
find the ridges and hyper-ridges of p. A zero dimensional ridge is a mode. 
Ridges of dimension larger than zero are defined precisely in Section 4 but 
for now, the reader can think of a ridge as a low dimensional set where the 
density is locally, highly concentrated. Figure 4 shows a picture of a ridge. 

For the hidden manifold case, we assume that 

(6) P= (l-ry)Unif(/C) + ?7(VF*$<,) 

where 0<r/<l,VFisa distribution supported on a d-dimensional manifold 
M C )C where /C is a compact subset of with d < D, Unif(/C) is a uniform 
distribution on /C, <I>o- is a Gaussian distribution on with zero mean and 
covariance a I a, and * denotes convolution. 

Under model (6), the data generating process can be described in the fol- 
lowing steps: 

1. Draw B from a Bernoulli(r/). 

2. If S = 0, draw X from a uniform distribution on /C. 

3. If B = 1, let X = Z + ae where Z ^ W and e is additional noise. 

Points Xi drawn from Unif(/C) represent background clutter. Points Xi 
drawn from W -k can be thought of as noisy observations from M. If 
M consists of a finite set of points, this is a clustering model. 

Generally, M can only be estimated at a logarithmic rate (Genovese et al., 
2012b) while R can be estimated at a polynomial rate, as we show in this 
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.R 



h 



Figure 5. The outer circle denotes the manifold M . The dashed circle ts the ridge R of 
the density p. The ndge is a biased version of M and acts as a surrogate for M . The inner 
circle Rh shows the ndge from a density estimator with bandwidth h. R can be estimated 
at a much faster rate than M . 

paper. We will see that, under mild conditions, the ridges R of p can be 
thought of as surrogates for M. This means that R is close M and has a 
similar topological structure as M. We will make this idea precise in Section 
4. We can regard this surrogate i? as a biased — but estimable — version of 
M. See Figure 5. 

3. Technical Background. Before proceeding to a formal definition of 
ridges, we first review some background. We recommend that the reader 
quickly skim this section and then refer back to it as needed. 

3.1. Distance Function and Hausdorff Distance. We let B{x,r) = BD{x,r) 
denote a D-dimensional open ball centered at x with radius r. If A is a set 
and X is a point then we define the distance function 



where || • || is the Euclidean norm. Given two sets A and B, the Hausdorff 
distance between A and B is 



(8) Haus(yl, = inf<^ e : AcB®e and B C A® e\ = sup \dA{x)-dB{x)\ 



(7) 




X 
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where 

(9) Aee= \J BD{x,e) = : dA{x) <e^ 

is called the e-offset of A. The offset can be thought of as a smoothed version 
of A. For example, if there are any small holes in A, these will be filled in 
by forming the offset ^4 © e. 

We use Hausdorff distance to measure the distance between sets for several 
reasons: it is the most commonly used distance between sets, it is a very strict 
distance and is analogous to the familiar Lqo distance between functions for 
sets. 



3.2. Topological Concepts. This subsection follows Chazal, Cohen-Steiner 
and Lieutier (2009) Chazal et al. (2009) and Chazal and Lieutier (2005). 
The reach of a set K, denoted by reach{K), is the largest r > such that 
each point in K (B r has a unique projection onto K. A set with positive 
reach is, in a sense, a smooth set without self-intersections. 

Now we describe a generalization of reach called ^-reach. The key point 
is simply that the //-reach is weaker than reach. The full details can be 
found in the aforementioned references. Let ^ be a compact set. Following 
Chazal and Lieutier (2005) define the gradient Vyi(x) of dAix) to be the 
usual gradient function whenever this is well defined. However, there may 
be points x at which dA is not differentiable in the usual sense. In that case, 
define the gradient as follows. For x £ A define Va{x) = for all a; G A. 
For X ^ A, let T{x) = {y £ A : ||x — y|| = dA{x)}. Let Q{x) be the center 
of the unique smallest closed ball containing T{x). Define 

/-, ^\ „ , . X — @(x) 



The critical points are the points at which VAix) = 0. The weak feature size 
wfs(A) is the distance from A to its closest critical point. For < < 1, the 
fi-reach reach^(A) is 

(11) reach^(^) = mi{d : x{d) < fJ-} 

where x{d) = inf{| |VA(a:)| | : dA^x) = d}. It can be shown that reach^ is 
non-increasing in n, that wfs(^) = lim^_^o reach^(A) and that reach (A) = 
lim^^i reach^(A). 
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Figure 6. A straight line as infimte reach. A line mth a corner, as in this figure, has 
reach hut has positive fi-reach. 



As a simple example, a circle C with radius r has reach (i?) = r. However, if 
we bend the circle slightly to create a corner, the reach is but, provided the 
kink is not too extreme, the /i-reach is still positive. As another example, a 
straight line as infinite reach. Now suppose we add a corner as in Figure 6. 
This set has reach but has positive ^u-reach. 

Two maps f : A ^ B and g : A ^ B are homotopic if there exists a 
continuous map H : [0,1] x A ^ B such that ^^(0, x) = f{x) and H{1, x) = 
g{x). Two sets A and B are homotopy equivalent if there are continuous 
maps f : A ^ B and g : B ^ A such that the composition g o f is 
homotopic to the identity map on A and ii fog is homotopic to the identity 
map on B. In this case we write A = B. Sometimes A fails to be homotopic 
to B but A is homotopic to -B ® e for every sufficiently small e > 0. This 
happens because B (B e is slightly smoother than B. li A = B (B e for all 
small e > we will say that A and B are nearly homotopic and we will write 
A^B. 

The following result says that if a set K is smooth and K is close to K, then 
a smoothed version of K is nearly homotopy equivalent to K. 

Theorem 1 (Chazal and Lieutier 2005). Let K and K he compact sets 
and let e = Haus(K,K). // 

(12 e < ^ ^ ^ ' and ^ < a < reach„(i^ - 3e 

5^^ + 12 

then {K®a) 



3.3. Matrix Theory. We make extensive use of matrix theory as can be 
found in Stewart and Sun (1990), Bhatia (1997), Horn and Johnson (2013) 
and Magnus and Neudecker (1988). 

Let A be an m X n matrix. Then the Frobenius norm is ||^||f = \j^2ij,k ^"jk 
and the operator norm is \\A\\ = sup||^.||=;^ We define ||A||max = 
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maxj^fc It is well known that ||^|| < HAHj? < -y/n||A||, that 

< \Jmn\ \A\ Imax and that ^ \Jmn\ \A\ |max- 



The vec operator converts a matrix into a vector by stacking the columns. 
Thus, if A is m X n then vec(A) is a vector of length mn. Conversely, given a 
vector a of length mn, let [[a]] denote the mxn matrix obtained by stacking 
a columnwise into matrix form. 

If yl is m X n and B is px q then the Kronecker AfS>B is the mp x nq matrix 

AnB ■■■ AmB 



(13) 



AmlB • • • AmnB 



If A and B have the same dimensions, then the Hadamard product C = AoB 
is defined by Cjk = Aj].Bj].. 

For matrix calculus, we follow the conventions in Magnus and Neudecker 
(1988). If F : — )• is a vector- valued map then the Jacobian matrix 
will be denoted by F'{x) or dF/dx. This is the D x k matrix with F'{x)jk = 
dFi{x) / dxj. If F : — )• W^^p is a matrix-valued map then F'{x) is a 
mp X D matrix defined by 

If F : M"^'' — )- M™^P then the derivative is a mp x nq matrix given by 

dF dvec{F{X)) 



F'{X) 



dX dvec{X) 



We then have the following product rule for matrix calculus: if F : — t- 
j^mxp G:R^ W""! then 

^^^^^ = (G^ix) ^ I^)F'{x) + (/, ® F{x))G'{x). 

Also, if A{x) = f{x)I then A'{x) = vec(/) (V/(x))-^ where V/ denotes 
the gradient of /. 

Now we consider a special class of matrix valued functions: symmetric spec- 
tral functions (also called primary matrix functions); see section V.3 of Bha- 
tia (1997) and Chen et al. (2003). Let S be the set of symmetric D x D 
matrices. Let / : M — t- M. Let H ^ S have spectral decomposition H = 
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UMJ^ where A = diag(Ai, . . . , Ad) with Ai > A2 > • • • > \d- Define a 
matrix function /□ by 



(15) 



UH) = U 



/(Ai) 
/(A2) 














/(A 



D 



The function /□ : 5 — s- 5 is well-defined. It does not depend on the ordering 
of the eigenvalues or the choice of U in case of multiplicities of the eigen- 
values (Bhatia, 1997). Also, /□ is continuously differentiable if and only if 
/ is continuously differentiable. In general, /□ inherits whatever smooth- 
ness conditions (continuity, differentiability, etc) that / has; see Chen et al. 
(2003). 

We say that /□ is Frechet differentiable if there exists a linear map f^{H) 
such that 

(16) fn{H + E)- fu{H) - f^^{H)E = oi\\E\\F). 
The Frechet derivative f^{H) is given by (Bhatia (1997)) 

(17) f^{H)E = u(^f^o{U^EU)^U'^ ior all E£S 

where o denotes the Hadamard product and Is a D x D matrix with 
entries 



(18) 



f*i 




The following version of the Davis-Kahan Theorem is from Von Luxburg 
(2007). Let H and H be two symmetric, square D x D matrices. Let A be 
the eigenvalues of H. Let 5 C M and let V (V) be the matrix whose columns 
are the eigenvectors corresponding to the eigenvalues of H (H) in S. Let 

(19) /3 = min||A-s| : AG Ap|5^ sG^}. 

According to the Davis-Kahan theorem. 



(20) 



\vv' 



vv^W < 



\H- H\ 



14 GENOVESE, PERONE-PACIFICO, VERDINELLI, WASSERMAN 



Let H he a DxD square, symmetric matrix with eigenvalues Ai > • • • > A^^. 
Let H be another square, symmetric matrix with eigenvalues Ai > • • • > A^^. 
By Weyl's theorem (Theorem 4.3.1 of Horn and Johnson (2013)) we have 
that 

(21) Xn{H -H) + \i{H) < \i{H) < \{H) + X^{H-H). 
It follows easily that 

(22) \XiiH) - XiiH)\ < \\H-H\\<D\\H-H\\^^. 

4. Ridges. In this section we provide a mathematical framework for ridges. 
Specifically, we do the following: 

1. We give a formal definition for ridges. 

2. We establish some basic properties of ridges. 

3. We show that, under appropriate conditions, if two functions are close 
together then their ridges are close and their ridges are topologically 
similar. 

As in Ozertem and Erdogmus (2011), our definition of ridges relies on the 
gradient and Hessian of the density function p. 

4.1. Definitions. Given a function p : — t- M, let g(x) = Vp(x) denote 
the gradient at x and let H{x) denote the Hessian matrix. Let 

(23) Ai(a;) > A2(a;) > ••• > AD(a;) 

denote the eigenvalues of H{x) and let A(x) be the diagonal matrix whose 
diagonal elements are the eigenvalues. Write the spectral decomposition of 
H{x) as H{x) = U{x)A{x)U{xf. Fix < d < D and let V{x) be the last 
D — d columns of U{x) (that is, the columns corresponding to the D — d 
smallest eigenvalues). If we write U{x) = [V^(x) : l^(x)] then we can write 
H{x) = [Vo{x) : V{x)]K{x)[V^{x) : V{x)Y. Let L{x) = V{x)V{x)'^ be the 
projector on the linear space defined by the columns of V{x). Define the 
projected gradient 

(24) G{x) = L{x)g{x). 

If the vector field G{x) is Lipschitz then by Theorem 3.39 of Irwin (1980), 
G defines a global flow as follows. The flow is a family of functions (pix, t) 
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such that (p{x, 0) = x and (l)'{x, 0) = G{x) and (p{s, (j){t, x)) = + x). The 
flow hnes, or integral curves, partition the space and at each x where G{x) 
is non-null, there is a unique integral curve passing through x. The intuition 
is that the flow passing through re is a gradient ascent path moving towards 
higher values of p. Unlike the paths defined by the gradient g which move 
towards modes, the paths define by G move towards ridges. 

The paths can be parameterized in many ways. One commonly used para- 
meterization is to use t G [— oo, oo] where large values of t correspond to 
higher values of p. In this case t = oo will correspond to a point on the 
ridge. In this parameterization we can express each integral curve in the 
flow as follows. A map tt : M — >■ M.^ is an integral curve with respect to the 
flow of G if 



Definition: The ridge R consists of the destinations of the integral curves: 
y G i? if limt_j.oo 7r(t) = y for some vr satisfying (25). 

As mentioned above, the integral curves partition the space and for each 
x ^ R, there is a unique path ttx passing through x. The ridge points are 
zeros of the projected gradient: y ^ R implies that G{y) = (0, . . . , 0)-^. 

It will also be convenient to parameterize the curves by arclength. Thus, let 
s = s{t) be the arclength from 7r(t) to 7r(oo): 



(25) 



vr'(t) = G(vr(t)) = L(vr(t))5(vr(t)). 



(26) 




Let t = t(s) denote the inverse of s{t). Note that 



(27) t'is) 



1 



1 



1 



7r'(t(.))|| 



L{7r{t{s)))g{7r{t{s)m 



l|G(7r(t(.)))|r 



Let 7(s) 



7r(t(s)). Then 



(28) 




which is a restatement of (25) in the arclength parameterization. 



In what follows, we will often abbreviate notation by using the subscript s 
in the following way: Gs = G{'~f{s)), Hs = H('y{s)), . . ., etc. 



16 GENOVESE, PERONE-PACIFICO, VERDINELLI, WASSERMAN 



4.2. Differentials. We will need derivatives of g, H, and L. The derivative 
of g is the Hessian H. Recall from (14) that H'{x) = ^^^^^^^f^^- We also 
need derivatives along the curve 7. The derivative of a functions / along 7 
is 

(29) f^f^s) = fs = hni . 

Thus, the derivative of the gradient g along 7 is 

.^.^ • _• V g(7(^ + e))-g(7(^)) ^ , HsGs 

(30) 5^(3) = 55 = hm = Hs-fs = - 777777 ■ 



We will also need the derivative of H in the direction of a vector z which 
we will denote by 



H'(x; z) = lim 



H{x + €z) - H{x) 



We can write an explicitly formula for H'{x;z) as follows. Note that the 
elements of H' are simply the partial derivatives dHjk{x) / dxi arranged in 
a X D matrix. Hence, H'{x;z) = [[H'{x)z]]. (Recall that [[a]] stacks a 
vector into a matrix.) 

The collection {L{x) : x G R^} defines a matrix field: there is a matrix 
L{x) attached to each point x. We will need the derivative of this field along 
the integral curves 7. For any x ^ R, there is a unique path 7 and unique 
s > such that x = 7(5). Define 



(31) 



L{x) 



lim 



L{H{^{s + e)))-LiHijis))) 



To compute the direction derivative Lg, we first need the Frechet derivative 
of L when L = L{H) is viewed as a map L : 5 — )• 5 where S is the set 
of symmetric D x D matrices. Let 61 < 62 be real numbers with 61 < and 
(3 = h2 — hi > 0. Let denote all D x D symmetric matrices such that 
1, . . . , d and A,- <hi for j = d + 1, . . . , D. 



\j > 1)2 for j 



Lemma 2. The Frechet derivative of L on Sr satisfi' 



les, 



(32) L^E = U{ 





-B 




^D-d _ 



{U^EU) for each EgS, 
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where Ofc denotes a k x k matrix of zeroes, B is a d x {D — d) matrix with 
entries 

(33) Bjk = -T-r-. ^ ^ 



, ^<3<d., 1 < k < D - d. 



Proof. For H e 6/3 with H = UAU'^ write U = [V<, : V] and let L = 
L{H) = VV'^ which is the projector onto to the space spanned by the 
eigenvectors corresponding to A^^+i, . . . , A^j. 

Now take / to be any monotone, differentiable function such that /(x) = 1 
for X < hi and f{x) = for x > 62- It follows that, for each H £ Sjs, 
L{H) = fn{H). Then 






-B ' 








(34) /. 
where /* was defined in (18). Hence, from (17), for all G 5, 

(35) L^E = U\ 






-B 








o (U^EU) ] U^. 



□ 



Now we can find Lg. Recall that H'{x;z) = [[H'z]] denotes the direction 
derivative of H in the direction z. 

Theorem 3. Let x = 7(5) and write H = Hg = UMJ'^ and H' = H'{x). 
Suppose that H{y) G Sp for all y in an open neighborhood around x. We 
have that 



(36) Ls{x) 



1 



Odxd 


-B 




0{D-d)x{D-d) 



o{U'''[[H'Gs]]U) )C/^. 



Proof. Recah that 7^ = Let E = [[H'-f'^]] = -[[H'Gs]]/\\Gs 

Note that 

7(s + e) = 7(s) + ej'{s) + o(e) = 7(5) - eGs/\\Gs\\ + o(e) 

and 

Hs+e = H{jis) + ej'is)) + 0(e)) = H + eH'ijisy,j'{s)) + o(e) 
= H + e[[H'^'{s)]] + o(e) = H + eE + o(e). 
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Hence, 



where is the Frechet derivative of L. As we showed earher, the Prechet 
derivative , satisfies the following equation for any E e S, 






-B ' 








Hence, with E = -[[H'Gs]]/\\Gs\\, 



L, = L^E 






-B ' 








o {U^EU)^ U^. 



□ 



4.3. Assumptions. We now record the main assumptions that we will re- 
quire for the results. 

Assumption (AO). For all x, g{x), H{x) and H'{x) exist. 

Assumption (Al). There exists /3 > 0, (5 > and h\ < 62 such that, 
61 < 0, /9 = 62 - 61, and for a\\ x e R® 5, 

(Al) Aj(a;) > 62, i = 1, ■ ■ • , d and Aj(x) < 61, j = d+l,. . . ,D. 

Assumption (A2). For each x G i? © 5, 
(A2) \\g{x)\\ \\H'{x)\U^< 



2VD 



Condition (Al) says that p is sharply curved around the ridge in the D — d 

dimensional space normal to the ridge. It is akin to requiring a function 
to have a negative second derivative at a mode. (A2) is a third derivative 
condition which implies that the paths cannot be too wiggly. 



Lemma 4. Conditions (A0)-(A2) imply that, for each x e {R ® S) — R, 
there is a unique path 7 passing through x. 
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Proof. We will show that the vector field G{x) is Lipschitz over R(BS. The 
result then follows from Theorem 3.39 of Irwin (1980). Recall that G = Lg 
and g is differentiable. It suffices to show that L is differentiable over R(BS. 
Now L{x) = L{H{x)). The Frechet derivative of L as a function of H was 
given in Lemma 2. And H is differentiable by assumption. By the chain rule, 
L is differentiable as a function of x. Indeed, dL/dx is the x D matrix 
whose j^^ column is yec{L^Ej) where Ej = [[H'ej]] and ej is the vector 
which is 1 in the j^^ coordinate and zero otherwise. □ 

4.4. Quadratic Behavior. Conditions (Al) and (A2) imply that the func- 
tion p has quadratic-like behavior near the ridges. This property is needed 
for establishing the convergence of ridge estimators. In this section we form- 
alize this notion of quadratic behavior. Give a path 7, define the function 



Thus ^ is simply the drop in the function p along the curve 7 as we move away 
from the ridge. We write S,x{s) if we want to emphasize that corresponds 
to the path 'jx passing through the point x. Since ^ : [0,oo) — )• [0,oo), we 
define its derivatives in the usual way, i.e. ^'(s) = dS,{s)/ds. 

Lemma 5. Suppose that (A0)-(A2) hold. For all x £ R® 6, the following 
are true: 



(37) 



C{s) = p(^(oo)) - p{TT{t{s))) = p(7(0)) - p{-t{s)). 



1. e(0) = 0. 

2- e{s) = \\Gi^{sm and e{0) = 0. 
3. The second derivative of is: 



(38) 




Gg HgG, 
||G,||2 



s 



+ 



\\Gs\\ 




Proof. 1. The first condition ^(0) = is immediate from the definition. 
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2. Next, 

^ (s) = J- — = -9sls - 



ds \\Gs\\ \\Gs 

\\Gs\\. 



9^ LgLsQs _ GJGs 



\\Gs\\ \\Gs 

Since the projected gradient is at the ridge, we have that ^'(0) = 0. 

3. Note that (^'(s))^ = ||G,||2 = G^Gs = g^L^Qs = a{s). Differentiating 
both sides of this equation we have that 2^'(s)^"(s) = a'{s) and hence 



2e(s) 2||G,||- 
Now 

(39) a'(s) = {gsflggs + g^Lggs + g^Lggs = 2{gsfLsgs + gjiggg. 
Since LgLg = Lg we have that Lg = LgLg + LgLg and hence 

g^Lsgs = glLgLsgs + gJigLggs = G^Lsgs + gjLgGs = 2gjLsGs. 
Therefore, 

(40) a'{s) = 2{ggfLsgs + 2gjL,Gs. 
Recall that gs = Thus, 



2||G.|| ||G,||2 ||G, 



4. The first term in £,"{s) is — • Since G is in the column space of 

V, G^H.Gs = Gj{VsA,V,^)Gs where A, = diag(Ad+i(7(s)), . . . , Xd{i{s))). 
Hence, from (Al), 

and thus 



\\G 



^ 2 
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Now we bound the second term. The second term is 

o{U^[[H'Gs]]U)^U^'^Gs 

[[H'A,]]U)ju^^A, 

where Ag = Gs/\\Gs\\- Ag is a unit vector in the column space of V. Now, 
the II 

■ I |max| |"iiom^ of the term inside the braces {•} is bounded above by 
VD\\H' Winax/ P- Here we used the fact that, due to (Al), \Bjk\ < for 
each j,k. Hence, from (A2), 

\gjLsGs\ ^ VD\\H'\\ra^ \gjAs\ ^ VD\\H'\\^^ ^ f3_ 
\\Gs\\ - P - P - 2' 



WG.Al 



IG. 



12 





-B 




^D-d . 



9i{U 





-B 


-B'' 


^D-d . 



5. Follows from 2. 

6. For some < s < s, 

from part (4). So 

C(^) -?(0)>^.^> ^||7(0)- 7(^)11'- 

□ 



4.5. Stability of Ridges. Wc now show that if two functions p and p are 
close, then their corresponding ridges R and R are close. We use g,H, . . . 
etc to refer to the gradient, Hessian and so on, defined by p. For any function 
/ : ^ M let 

sup \fix)\. 



Let 



x&R®5 



(42) e=\\p-p\\oo e' = max||5j - £/j||oo 

(43) e" = max \\Hjk - Hjk\\oo e'" = max \\H'jf.- H'j^ \ \ oo 
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Theorem 6. Suppose that (A0)-(A2) hold for p and that (AO) holds for p. 
Let = max{e, e', e"} and let = max{e, e', e", e'"}. When ^ is sufficiently 
small: 

(1) Conditions (Al) and (A2) holds for p. 

(2) We have: 

2Cip 



(44) Haus(i?,ii:) < 



(3) If reach ^{R) > for some /U > 0, then 
(45) Re^?^R. 



Proof. (1) Write the spectral decompositions H = UAU'^ and H = UAU'^. 
By (22), 

|A,--A,-| <D\\H-H\U^<De". 

Thus, p satisfies (Al) when e" is small enough. Clearly, (A2) also holds as 
long as ^ is small enough. 

(2) By the Davis-Kahan theorem, 

llr 7|| < \\H-H\\f ^ D\\H-H\U, ^ De^ 
II II- /3 - /3 - 13 ■ 

For each x, 

\\Gix)-G{x)\\ = \\Lix)g{x)-L{x)g{x)\\ 

'Lix)-L{x))g{x)\ \ + \ \L{x)(g{x) - gix)^ 



< 



^ DMx)\\e" 

It follows that, \\L-L\\< Ce" and sup^ \\G{x) - G{x)\\ < Ct/;. 

Now let X £ R. Thus ||G(x)|| = and hence ||G(x)|| < C^j. Let 7 be the 
path through x so that 7(5) = x for some s. Let r = 7(0) € R. From part 2 
of Lemma 5, note that (,'{s) = ||G(x)||. We have 

G^l^>\\G{x)\\=eis)=eiO) + se'iu) 
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for some u between s and 0. Since ^'(0) = 0, from part 4 of Lemma 5, 

and so 

2CV' 

||r — x|| < s < — -— . 
Thus, d{x,R) <\\r-x\\ < 2CVV/3- 

Now let X £ R. The same argument shows that d{x, R) < 2C^j/ (3 since (Al) 
and (A2) hold for p. 

2 

(3) Choose any fixed k > such that k < ^^2^12 ■ When is sufficiently 
small, ^ < K reachu (K). Then i?© ^ ^R fr om Theorem 1. □ 



5. Ridges of Density Estimators. Now we consider estimating the 
ridges in the unstructured case. Let Xi,. . . Xn ~ P where P has density 
p and let 

1 " 1 

(46) i^^(-) = -E/^^ 

i=l 

be a kernel density estimator with kernel K and bandwidth h. Let R be the 
ridge defined by p. In this section we bound Haus(i?, R). We assume that P 
is supported on a compact set C C M^. 

We assume that all derivatives of p up to and including fifth degree are 
bounded and continuous. We also assume the conditions on the kernel in 
Gine and Guillou (2002) which are satisfied by all the usual kernels. Results 
on \\p{x) — Phix)\\oo are given, for example, in Prakasa Rao (1983), Gine 
and Guillou (2002) and Yukich (1985). The results in those references imply 
that 

e = sup\\p{x) -p{x)\\oo = 0{h?) + Op 

For the derivatives, rates are proved in the sense of mean squared error by 
Chacon et al. (2011). They can be proved in the norm using the same 
techniques as in Prakasa Rao (1983), Cine and Guillou (2002) and Yukich 



Xi 
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(1985). The rates are: 



e' = maxsup||5j(x) - ?j(x)||oo = 0{li^) + Op ( ^ 



Let 

(47) A, 



2 

n 



logn 



e" = maxsup ||Fj,fc(x) - Hj^k{x)\\oo = 0{h'^) + Op , 
e'" = rnaxsup ||F;.,(x) - H'^^,{x)\\^ = Oih') + Op N 



Choosing h x -v/0n we get that 

e X e' X e" X e'" x Op(V'n)- 
From Theorem 6 and the rates above we have the fohowing. 

Theorem 7. Let R* = R D {R (B S). Under the assumptions above and 
assuming that (Al) and (A2) hold, we have, with h x that 

(48) Haus(i?,^*) = Op(V'n)- 

// reach^(i?) > then R* O(V'n) 



Let Ph{x) = IE(j5^(x)) and let Rh be the ridge set of p^^. It may suffice for 
practical purposes to estimate Rh for some small /i > 0. The reason is that 
Rh will be topologically very similar to R but is much easier to estimate. 
This point is discussed in more detail in the next section. In this case we 
can take h fixed rather than letting it tend to 0. For fixed h we then have 
dimension independent rates: 

Theorem 8. Let h > be fixed and let ipn = y^logn/n. Let R* = Rn{RQ 
6). Under the assumptions above and assuming that (Al) and (A 2) hold for 
Rh we have, that 

(49) Haus{Rh,R*) = Op{i^n)- 

If reacU^{Rh) > then R* 0(V^„) 
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6. Ridges as Surrogates for Hidden Manifolds. Consider now the 
case where = {I — ?7)Unif(/C) + r/(PF ★ where W is supported on 

M. We assume that W has a twice-differentiable density w respect to the 
uniform measure on M. We also assume that w is bounded away from zero 
and DO. 

We want to show that the ridge of is a surrogate for M. SpecificaUy we 
show that, as a gets smah, there is a subset c .R in a neighborhood 
of M such that Haus(M,i2*) = 0{a^log^{l/a)) and such that i?* %M. 
We assume that 77 = 1 in what fohows; the extension to < r/ < 1 is 
straightforward. We assTimc that AI is a compact (i-manifold with positive 
reach k. We need to assume that M has positive reacli rather than just 
positive /x-reach. The reason is that, when M has positive reach, the measure 
W induces a smooth distribution on the tangent space T^M for each x G M. 
We need this property in our proofs and this property is lost if M only has 
positive //-reach for some n < 1 due to the presence of unsmooth features 
such as corners. 

The density of X is 

(50) p^{x) = I Mx - z)dWiz) 

Jm 

where (paiu) = (27r)^^/^(7~-^ exp ^^^^ • Thus, is a mixture of Gaus- 
sians. However, it is a rather unusual mixture; it is a singular mixture of 
Gaussians since the mixing distribution W is supported on a lower dimen- 
sional manifold. 

Let TxM be the tangent space to M at x and let T^M be the normal 
space to M at x. Define the fiber at x G M by F-x = T^M Pi Bj:i{x,r). A 
consequence of the fact that the reach k is positive is that, for any < r < k, 
M r can be written as a disjoint union 

(51) M®r= [j F^. 
Let To- > satisfy the following conditions: 

(52) r<j < (7, ^^00, ra log ( ^^^^ ) = o(l) as a ^ 0. 



Specifically, take = aa for some < a < L Fix any A>2 and define 



(53) iC^ = W 2(72 log 
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Theorem 9 (Surrogate Theorem). Suppose that k = reach(M) > 0. 

Let be the ridge set of Pa- Let M^- = M ® r^- and R% = i?a-P|Mo-. For 
all small cr > 0; 

1. R% satisfies (Al) and (A2) with (3 = ccr~(^~'^+^) form some c > 0. 

2. Haus(M,i?*) = 0{K^). 

3. R% e CKl %M. 

If R(T is instead taken to he the ridge set of log Pa then the same results are 
true with (3 = ca~'^ and = M (B k. 

The theorem shows that in a neighborhood of the manifold, there is a weh- 
defined ridge, that the ridge is close to the manifold and is nearly homo- 
topic to the manifold. It is interesting to compare the above result to recent 
work on finite mixtures of Gaussians (Carreira-Perpinan and Williams, 2003; 
Edelsbrunner et al., 2012). In those papers, it is shown that there can be 
fewer or more modes than the number of Gaussian components in a finite 
mixture. However, for small cr, it is easy to see that for each component of 
the mixture, there is a nearby mode. Moreover, the density will be highly 
curved at those modes. Theorem 9 can be thought of as a version of the 
latter two facts for the case of manifold mixtures. 

The theorem refers to the ridges defined by pa and the ridges defined by 
log Pa- Although the location of the ridge sets is the same for both cases, 
the behavior of the function around the ridges is different. There are several 
reasons we might want to use logp rather than p. First, when p is Gaussian, 
the ridges of logp correspond to the usual principal components. Second, the 
surrogate theorem holds in an 0(1) neighborhood of M for the log-density 
whereas it only holds in an 0(5) neighborhood of M for the density where 



To prove the theorem we need a preliminary result. In what follows, if A is 
a matrix, then an expression of the form A + 0(r„) is to be interpreted to 
mean A + Bn where -B„ is a matrix whose entries are of order 0(r„). Let 



(54) 




(55) 



e-IHlV(2a2) 



n G M 



D-d 



(27r) — CT^-'^ 
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Lemma 10. For all x G M„, 



1. p„{x) = ^^{x-x){l + 0{a)). 

2. Letpa,B{x) = Jj^j^g(j)aix-z)dW{Z). Thenpa,B{x) = </'±(x-a;) (1 + 0{a)). 

3. g^x) = -^p.(x) {{x -x) + 0{Kl)) and \\g.{x)\\ = 0(c7-(^-'^-i)). 
4- The eigenvalues of Ho-{x) are 



(56) Xj{x) = 
5. The projection matrix L^j satisfies 



0{a 

Per' 

a 



1 - + o(5) 



i < d 

i = d + l 



Lr,(x) 



Od Od,D-d 



+ 0{a). 



6. Projected gradient: 



GJx) 



1 



(x - x)(t>^{x - x)(l + 0(5)) + 0^{Kl) 



where 0±{K^) is a term of size 0{K^) in T^ . 
7. Gap: 



\d{x) - \d+l{x) > 
and \d+i{x) < —f3. 

8. ||//;|Uax = 0(^-(^+3-'^)). 



[l-a^ + 0{a) 



Proof. The proof is quite long and technical and so we relegate it to the 
Appendix. □ 



Proof of Theorem 9. Let us begin with the ridge based on p^^. 

(1) Condition (Al) follows from parts 8 and 1 of Lemma 10 together with 
equation (66). 



To verify (A2) we use parts 3 and 8 of Lemma 10: we get that 



\H'\ 



< 



1 



< 



^D-d-l ^D+3-d 2/Dcr2(i5-'^+2) 2/D 
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as required. 

(2) Suppose that x £ R*. Then = 0. Let x be the unique projection 
of X onto M. From part 6 of Lemma 10 

||(x - x)Mx - x)(l + 0{a)) + 0±{K^)\\ = 

and hence x = x + 0{K^). 

Now let X G M. From the expression above, we see that ||GCT(a;)|| = 0{K^). 
Let 7 be the path through x and let r be the destination of the path. Hence 
7(5) = X for some s and 7(0) = r. Now we use Lemma 5. Then ||G|| = ^' 
and 

o{Kj) = e{s) = c'is) - e'(o) = sec^ >\\x- file" (5) >\\x- fii/3/2 

and so ||x-f|| =0{Kl). Hence, Haus(i?^,M) = 0{Kl). 

(3) Homotopy. This follows from part (2) and Theorem 1. 

Now consider the ridges of \ogpa{x). The proof is essentially the same as 
the proof above. The main difference is the Hessian as we now explain. Note 
that the Hessian H* for \ogpa{x) is 

H*Ax) = -)-r [Haix) ]—ga{x)g^{x)] . 

From Lemma 10, parts 3 and 4, it follows that (after an appropriate rota- 
tion), 

K{x) = -j,l^ 

Hence, 

Xd+i{x) = — ^ + 0(a) 

and ^ 

Arf+i(x) - Xd{x) = ^ + 0(a). 

Notice in particular, that the dominant term of the smallest eigenvalue of 
—f3H*{x) is 1 whereas that the dominant term of the smallest eigenvalue of 
—j3Hfj{x) is 1 d\,i{x)/a'^ which is why we required ||x — a;|| to be less than 
a in Theorem 9. Here we only require that ||x — x|| < k. □ 

We may now combine Theorems 6, 7, 8 and 9 to get the following. 



Od OdxD-d 
OD-dxd I D-d 



+ 0{a: 
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Corollary 11. Let R* be defined as in Theorem 7. Then 



(57) 



Hau5{R*,M) 



Op 



( 



logn\ 0+8 



) 



+ 0{Kl). 



Similarly, if R* be defined as 



in Theorem 8 then 



(58) 



Haus(i2*,M) 



Op 




log n 



n 



) 



+ 0{Kl + h^). 



7. Subspace Constrained Mean Shift. To find the ridges of a kernel 
density estimator we use the subspace constrained mean shift (SCMS) al- 
gorithm due to Ozertem and Erdogmus (2011). Let us begin by reviewing 
the mean shift algorithm. 

The mean shift algorithm (Fukunaga and Hostetler (1975); Comaniciu and 
Meer (2002)) is a method for finding the modes of a density by approxim- 
ating the steepest ascent paths. The algorithm starts with a mesh of points 
and then moves the points along gradient ascent trajectories towards local 
maxima. 

Given a sample Xi , . . . , Xn from p, consider the kernel density estimator 



where K is a kernel and /i > is a bandwidth. Let M. = {ui, . . . , Vm} be a 
collection of mesh points. These are often taken to be the same as the data 
but in general they need not be. Let Vj{l) = vj and for t = 1, 2, 3, ... we 
define the trajectory Vj{l),Vj{2), . . . , by 



It can be shown that each trajectory {vj{t) : t = 1,2,3, ... ,} follows the 
gradient ascent path and converges to a mode of ph ■ Conversely, if the mesh 
M is rich enough, then for each mode ph, some trajectory will converge 
to that mode. See Figure 3. 



(59) 




i=l 



(60) 
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Subspace Constrained Mean Shift Algorithm 



1. Input: q{x) — \ogp{x) and a mesh of points AI = {mi, . . . , m]v}. Tolerance e. For 
example, e = 10~^. 

2. Let ^(a;) and H{x) be the gradient and Hessian of q. 

3. Let V = [vi ■ ■ ■ vd^] be the D x {D — d) matrix whose columns are the eigenvectors 
of H{x) corresponding to the smallest D — d eigenvalues. 

4. For each mesh point m repeat: 



where a = h-'^ K{\\m - Xi\\/h). 
5. Stop whem |ff^(m)Vl/^5(m)|/(||5(m)|| • HV^^ffMH) < e. 



Figure 7. The SCMS algorithm of Ozertem and Erdogmus (2011). 

The SCMS algorithm mimics the mean shift algorithm but it replaces the 
gradient with the projected gradient at each step. The algorithm can be 
applied to p or any monotone function of p. As we explained earlier, there 
are some advantages to using log p. Figure 7 gives the algorithm for the 
log-density. This is the version we will use in our examples. 

The SCMS algorithm provides a numerical approximation to the paths 7 
defined by the projected gradient. We illustrate the numerical algorithm in 
Section 8. 

8. Implementation and Examples. Here we demonstrate ridge estim- 
ation using SCMS in some two-dimensional examples. In each case, we will 
find the d = 1 one dimensional ridge set. Our purpose is to show proof of 
concept; there are many interesting implementation details that we won't 
address here. In each case we use a modified SCMS algorithm in which 
we first filter the data by throwing away low density points. To do so, we 
threshold the density estimator. 

To implement the method requires that we choose a bandwidth h for the 
kernel density estimator. There has been recent work on bandwidth selec- 
tion for multivariate density estimators such as Chacon and Duong (2010), 
Chacon and Duong (2012) and Panaretos and Konis (2012). For the purposes 
of this paper, we simply use the Silverman rule (Scott, 1992). 




Figures 8 through 10 show two examples of the modified SCMS algorithm. 
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In the first example, the manifold is a circle. Although the circle example 
may seem easy, we remind the reader that no existing statistical algorithms 
that we are aware of can, without prior assumptions, take a point cloud as 
input and find a circle, automatically. 

The second example is a stylized "cosmic web" of intersecting line segments 
and with random background clutter. This is a difficult case that violates the 
assumptions; specifically the underlying object does not have positive reach. 
The starting points for the SCMS algorithm are a subset of the grid points 
at which a kernel density estimator is evaluated. We select those points for 
which the estimated density is above a threshold relative to the maximum 
value. 

Figure 10 shows the estimator for four bandwidths. This shows an interesting 
phenomenon. When the bandwidth h is the large, the estimator is biased (as 
expected) but it is still homotopy equivalent to the true M. However, when h 
gets too small, we see a phase transition where the estimator falls apart and 
degenerates into small pieces. This suggests it is safer to oversmooth and 
have a small amount of bias. The dangers of undersmoothing are greater 
than the dangers of oversmoothing. 

The theory in Section 6 required the underlying structure to have positive 
reach which rules out intersections and corners. To see how the method fares 
when these assumptions are violated, see Figure 11. While the estimator is 
far from perfect, given the complexity of the example, the procedure does 
surprisingly well. 

9. Conclusion. We presented an analysis of nonparametric ridge estima- 
tion. Our analysis had two main components: conditions that guarantee that 
the estimated ridge converges to the true ridge, and conditions to relate the 
ridge to an underlying hidden manifold. 

We are currently investigating several questions. First, we are finding the 
minimax rate for this problem to establish whether or not our proposed 
method is optimal. Also, Klemela (2005) has derived mode estimation pro- 
cedures that adapt to the local regularity of the mode. It would be interesting 
to derive simialr adaptive theory for ridges. Second, the hidden manifold case 
required that the manifold has positive reach. We are working on relaxing 
this condition to allow for corners and intersections (often known as strat- 
ified spaces). Third, we are developing an extension where ridges of each 
dimension d = 0, 1, . . . are found sequentially and removed one at a time. 
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Figure 8. Estimated hyper-ridge set (red curve) from data generated from a circular mani- 
fold M (blue curve) of radius 3. The sample size is 1000, using Normal noise with a = 0.5. 
The estimate is computed from a kernel density estimator using the Silverman Normal ref- 
erence rule for the bandwidth. The starting points for the modified SCMS algorithm are 
taken the evaluation points of the density estimator excluding the points below 25% of the 
maximum estimated density. 
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Figure 9. Estimated hyper-ridge set (red curve) from data generated from a circular 
manifold M (blue curve) of radius 3. The sample size ts 20000, using Normal noise with 
a = 0.5. The estimate is computed from a kernel density estimator using the Silverman 
Normal reference rule for the bandwidth. The starting points for the modified SCMS al- 
gorithm are taken the evaluation points of the density estimator excluding the points below 
25% of the maximum estimated density. 



34 GENOVESE, PERONE-PACIFICO, VERDINELLI, WASSERMAN 




Figure 10. Effect of decreasing bandwidth. The data are iid samples from the same man- 
ifold as m the previous figure. Eventually we reach a phase transition where the structure 
of the estimator falls apart. 
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Figure 11. Data generated from a stylized "cosmic web" consisting of intersecting line 
segments and a uniform background clutter. Total sample size is 10000. The starting points 
for the modified SCMS algorithm are taken the evaluation points of the density estimator 
excluding the points below 5% of the maximum estimated density. 
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This leads to a decomposition of the point cloud into structures of increas- 
ing dimension. Finally, there are a number of methods for speeding up the 
mean shift algorithm. We are investigating how to adapt these speedups for 
SCMS. 



Appendix: Proof of Lemma 10. The gradient is 



(61) ga 


(I) 








- z)dW{z) 


and the Hessian is 














i 


(- 


(x — z){x 




(p^{x - z)dW{z) 


M 






("2) 


Pa{ 


x)I 


Jm 


- z)ix - 


- z)'^(l)a(x - z)dW{z) 



We can partition Mg- into disjoint fibers. Choose an x € Mo- and let x be 
the unique projection of x onto M. Let B = B{x,Ka-). For any bounded 
function f{x, z), 

(63) / f{x,z)cp„{x-z)dW{z)<—^ ^ W{B'')<Ca^. 



'MnB'= 



(2^)- 



Let T = T^M denote the d-dimensional tangent space at x and let T-*- 
denote the {D — (i)-dimensional normal space. For z & B D M, let z be the 
projection of z onto T. Then 

(64) x — z = (x — x) + {x — z) + {z — z) = dM{x)u + {x — z) + R 

where u = {x — x)/dM{x) G T"*- and R = {z — z). (Recall that du is the 
distance function; see (7).) For small enough a, the map h taking z to z 
is a bijection B D M and so the distribution W induces a distribution W, 
that is, VF(A) = W{h~^{A)). Let w denote the density of W with respect 
to Lebesgue measure fid on T. The density is bounded above and below and 
has two continuous derivatives. 

Lemma 12. For every x £ R(B a, sup^g^ ||2: — z|| < cK^. 



Proof. Choose any z G B and let z be its projection onto T. Because the 
reach is k > 0, there exists a ball S{a, k) C such that a is in the plane 
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defined x, z and z, S'(a, k) is tangent to the manifold at x and 5(a, k) does 
not intersect M except at x. Consider the hne through z and z and let Za 
be the point where the line intersects 5(a, k). Now — 2|| > ||z — z|| and 
by elementary geometry, H-Za — z|| < CK^. □ 

Recall that = aa with < a < 1. Define the following quantities: 
^ = ^^zj_,+2 ' a = alog 

g-||«||V(2^') g-|h||V(2a2) 

(27r)^(7-D-<^ (27r)2c7<^ 

Va,B{x)= [ Mx- z)dW{z) B = B{x,K^) 

JMnB 

where u G M^"'' and w € M''. 
Lemma 13. We have that 

(65) = 0x(a;-x)(/)||(x-z)(l + O(5)). 

Proof. First note that, for all x E Ra, 

1 e-"'/2 1 I 

(66) ^D-d <M^-^)< .D^^D-d 
(27r) 2 cr (^2ir) 2 cr 

and so, (l)±{x — x) x cr^^^"'^) as cr ^ 0. Now, 

||x — zll^ = ||x — x|p + ||x — + ||z — zll^ + 2(x — X, z — z), 
we have that 

(j)^{x -z) = (l)±{x - x)(t)\\ {x - ^)e-ll^-^llV(2<^')e-(^-^'^-^)/^'. 

Now ||z - z|p = 0{K^) and \{x - x,z - z)\ < \\x - x\ \ \ \z - z\\ = OiaK"^) 
and so 

□ 
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Proof of the Lemma 10. 

1. From (63), Pa{x) = jj^^^^ (l)„{x - z)dW{z) + O(a^). Now 



^„{x - z)dW{z) = (1 + 0{a))(p^{x - x) / ^\\ [x - z)dW{z) 
MnB JMr\B 



= {1 + 0{a))<i)^{x -x) I (t)\\{x - z)w{z)d^,d{z) 

JTMnB 

= (1 + 0{a))Mx -x) _L_e-ll*llV2^(j + at)dUt) 

where A = {t = {z — x)/a : z G B}. The volume of A is 0(cj^+^) and 
^ ^ M"^ as C7 0. Also, w{x + at) = w{x) + 0{a). Hence, 

/ + at)dfid{t) = + 0{a)){l - 0(cj^+^)) 

J A 

and so 

Mx - z)dW{z) = 4>±{x- x){l + 0{a)){w{x) + 0{a)){l - 0(a^+^)) 



MnB 
and 

p.ix) = cP^ix - x){l + 0(5))(iZJ(x) + 0{a)){l - 0((7^+^)) + 0((7^) 
= </.x(x-S)(l + 0(5)). 

2. PaB{x). This follows since in part 1 we showed that PaB{x) = Pcr{x) + 

3. For the gradient we have 
-a'^gaix) = j {x - z)(j)„{x - z)dW{z) 

= 1 + II + III. 

Now, I = (x — x)pa{x) = {x — x)(j)±{x — x)(l + 0(a)) and 

11= / {^-z)(l)„{x-z)dW{z) + 0{a^) 
J MnB 



{l + 0{a))(l)±{x-x) / {x-z)(j)ii{x-z)dW{z) + 0{a^). 
J MnB 
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For some u between x and z we have 

r — ~^ 

X — z)(l)\\{x — z)dW{z) = a / (l)\\{x — z)dW{z) 

AinB ' JMnB 

x — z 

a 



4>u{x - z)w{z)dixd{z) 

h-^B) <^ 



where A = {t = (z - x)/a e h~^{B)}. Finally, 

III = / (z-z)(l)^{x-z)dWiz)+0{a^) = (l)±ix-x)0{K^)+0{a^) = 0{kI)<I)^{x-x). 
JhinB 

Hence, 

-ctV(^) = ix-x)p^ix)+0{a^)+Mx-x)0{K^^)=p^{x){{x-x)+0{K^)) 
and hence 

9a{x) = -^Pa[x) {{x -x) + 0[Kl)) . 
It follow from part 1 that \\g„{x)\ \ = Oia-^^-^^-^^). 

4. To find the eigenvalues, we first approximate the Hessian. Without loss of 
generality, we can rotate the coordinates so that T is spanned by ei, . . . , e^, 
T"*- is spanned by ed+i, . . . , and u = (0, . . . , 0, 1). Now, 

a^H^jx) _^ J{x-z){x-zf(l)„{x-z)dW{z) 
Pu{x) a"^ J (j)aix - z)dW{z) 



and 



[ {x-z)ix-zf(l)^ix-z)dW{z) = 0{a^). 
JMnB'^ 



Let Q = /jv/n_B(^ — z){x — z)'^cl}a-{x — z)dW(z). Then, from (64), we have 
Q = Qi + Q2 + Q3 + Q4 + Q5 + Qs where 

Qi = dli{x)uu^ f Mx-z)dWiz) ^- - / 

JAinB 



Q5 



AinB 



iMnB 



Q2 = 


1 > 


[x 


— z){x 




JMnB 






Q4 = 


1 ' 


[x 


— x){x 




JMnB 






Q6 = 


1 ' 


[x 


-z){z 




JMnB 
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First, we note that 

Qi = d\j{x)uu^ (l)±{x -x){l + 0{a)). 



Next, 



Q2= [ ix-z){x-zfMx-z)dW{z) 
JMnB 

= (1 + 0{a))(j)±{x -x) [ {x-z){x- zf(t>ii (x - z)dW{z) 

JMnB 



and 



/ {x — z){x — z)'^ (f)\\{x — z)dW {z) = I (x — z)(x — z)'^(f)\\{x — z)w{z)diJ,fi{z) 
JMnB A-i(B) 

= w{x) [ {x-z){x-zf(t)iiix-z)dndiz) + 0{K^). 

Jh-^B) 

Next, with t = {h, . . . , t^, 0, . . . , 0), 

f {x-z){x-zfh{x-z)dfid{z) = a^ / «^(27r)-'^/Vll*ll'/2d^d(i) 

Jh-^{B) JB 

= a''(^j tf{2T:)-''/\-\\'\\'l^dUt) - tf{2T^)-'"^e-\\'\\''^dUt)^ 



and so 



h 




Q2 = (l + 0(5))<^x(x-x)(7^ 



h 




A similar analysis on the remaining terms yields: 

Q^ = {l + 0{a))(t>^{x-x)0{Kl) 
Q4 = (1 + 0{a))<P^{x - x)0{aKl) 
Qr, = {l + 0{a))(t>^{x-x)0{aKl) 
Q^ = {l + 0{d))4>^{x-x)0{Kl). 
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Combining all the terms we have 

Q = (1 + 0{a))(f)±{x - x) ^(fM{x)uu^ + 

Hence, 



Id 




0{a 



+ 0{aKl). 



H,{x) = -(l+0(5))^ 



•• 


• 


























•• 


• 












•• 
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• • 







•• 
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1 ■ • 







•• 
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•• 


• 1 





•• 
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•• 
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-| dlji'x) 



a' 



V 

The result follows. 

5. This follows from part 4 and the Davis-Kahan theorem. 

6. From part 5, Lo-(x) = + E where = 



+ 0{a) 



and E = 



Odxd ^d,D-d 
^ OD-d,d lo-d ^ 

O (a). Hence, Ga{x) = L„{x)g^{x) = [L'^ + E)g^{x) and the result follows 
from parts 3 and 4. 

7. These follow from part 4. 

8. Now we turn to Let A = {x — z). We claim that 

H' = \ l\{^®I){I®A + A /) - ^^^(/ ® AA^)(vec(/) ® A^)] dW{'. 
(7^ J I J 

+ ^ y" <^,(A)(vec(/) A^)dWiz). 
To see this, note first that H = \Q — \A where 
(67) Q = j{x-z){x- zf(pa{x - z)dW{z) and A = Pa{x)I. 
Note that Q = j\x-z){x- zY^dW{z) where $ = (t)a{A)lD. So 
0' = /w<ix)|(x-.)(x-.f.l.^(.) 
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— Ux - z)(x - zY $ = ^ = . 

dx^^ ^ ' dx dA 

Now (d/da;)(AA^$) = (fg)' where / = AA^ and 5 = $ and so 

-f-(AA^$) = ($®/)-^(AA^) + (/® AA^)^$ 
dx dx dx 

= ($ (g) /)(/ (8) A + A ® /) - ^^^(I (8) AA^)(vec(/) A^). 



(7^ 



Hence 



Q' = y[($®/)(/®A + A®I)-^^^(/®AA^)(vec(I)® A^)]dVF(z). 
By a similar calculation, 

A' = -^ J 0a(A)(vec(I)® A^)dT^(z). 

Thus, 



H' = ^Q' - ^A' 



= ^ / ® /)(/ ® A + A ® /) - ^^^(/ «) AA^)(vec(/) ® A^)l dW{z) 
cr^ J I J 



+ 



^<,(A)(vec(/) ® A^)dl^(z). 



Each of these terms is of order 0{sup^^]^ \\w"{x)\\/a^ ''+^). Consider the 
first term 

^ J ^ ^ A)dW{z) = ^^{27r)^/^ J e-ll^-^ll'/(2'^')(7(8)A)dW^(z) 

where u = {x — z)/a. As in the proof of part 1, we can restrict to B fl M, 
do a change of measure to W and the term is dominated by 



'f/;(x) + w'{u)au 



dnd{t) 



C 



(^3+D-<i(27r)^/2 ■ 
The other terms may be bounded similarly. 
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